Actuarial Applications of Natural Language Processing Using Transformers

A Case Study for Processing Text Features in an Actuarial Context

Part I – Introduction and Case Studies on Car Accident Descriptions

By Andreas Troxler, June 2022

An abundant amount of information is available to insurance companies in the form of text. However, language data is unstructured, sometimes multilingual, and single words or phrases taken out of context can be highly ambiguous. By the help of transformer models, text data can be converted into structured data and then used as input to predictive models.

In this Part I of tutorial, you will discover the use of transformer models for text classification. Throughout this tutorial, the HuggingFace Transformers library will be used.

This notebook serves as a companion to the tutorial "Actuarial Applications of Natural Language Processing Using Transformers”. The tutorial explains the underlying concepts, and this notebook illustrates the implementation. This tutorial, the dataset and the notebooks are available on github.

After competing this tutorial, you will know:

Let’s get started.

Notebook Overview

This notebook is divided into into seven parts; they are:

  1. Introduction

    1.1 Prerequisites

    1.2 Exploring the data

  2. A brief introduction to the HuggingFace ecosystem

    2.1 Loading the data into a DataSet

    2.2 Tokenization – splitting the raw text

    2.3 The transformer model

  3. Using transformers to extract features for classification or regression tasks

    3.1 Extracting the encoded text ...

    3.2 ... and using it in a classification model

    3.3 Case study: use accident descriptions to predict the number of vehicles involved

    3.4 Cross-lingual transfer

    3.5 Multi-lingual training

  4. Fine-tuning – improving the model

    4.1. Domain-specific finetuning

    4.2. Task-specific finetuning

  5. Understand predictions errors and interpret predictions

    5.1. Case study: use accident descriptions to identify bodily injury

    5.2. Investigate false positives and false negatives

    5.3. Use Captum and transformers-interpret to interpret predictions

  6. Using extractive question answering to process longer texts

  7. Conclusion

1. Introduction

1.1. Prerequisites

Computing Power

This notebook is computationally intensive. We recommend using a platform with GPU support.

We have run this notebook on Google Colab and on an Amazon EC2 p2.xlarge instance (an older generation of GPU-based instances).

Please note that the results may not be reproducible across platforms and versions.

Local files

Make sure the following files are available in the directory of the notebook:

This notebook will create the following subdirectories:

Getting started with Python and Jupyter Notebook

For this tutorial, we assume that you are already familiar with Python and Jupyter Notebook.

In this section, Jupyter Notebook and Python settings are initialized. For code in Python, the PEP8 standard ("PEP = Python Enhancement Proposal") is enforced with minor variations to improve readability.

Importing Required Libraries

The following libraries are required:

In addition, we require openpyxl to enable export from Pandas to Excel.

1.2. Exploring the Data

The data used throughout this tutorial is derived from data of a vehicle crash causation study performed in the United States from 2005 to 2007. The dataset has almost 7'000 records, each relating to one accident. For each case, a verbal description of the accident is available in English, which summarizes road and weather conditions, vehicles, drivers and passengers involved, preconditions, injury severities, etc. The same information is also encoded in tabular form, so that we can apply supervised learning techniques to train the NLP models and compare the information extracted from the verbal descriptions with the encoded data.

The original data consists of multiple tables. For this tutorial, we have aggregated it into a single dataset and added German translations of the English accident descriptions. The translations were generated using the new DeepL python API.

To explore the data, let's load it into a Pandas DataFrame and examine its shape, columns and data types:

The column SCASEID is a unique case identifier.

The columns SUMMARY_EN and SUMMARY_GE are strings representing the verbal descriptions of the accident in English and German, respectively.

NUMTOTV is the number of vehicles involved in the case. Let's have a look at the distribution of this feature:

Most cases involve two vehicles, and only very few accidents involve more than three vehicles.

Each of the columns WEATHER1 to WEATHER8 indicates the presence of a specific weather condition (1: weather condition present, 9999: presence of weather condition unknown, 0 otherwise):

column meaning count
WEATHER1 cloudy 1112
WEATHER2 snow 114
WEATHER3 fog, smog, smoke 28
WEATHER4 rain 624
WEATHER5 sleet, hail (freezing drizzle or rain) 25
WEATHER6 blowing snow 38
WEATHER7 severe crosswinds 20
WEATHER8 other 25

These weather conditions are not mutually exclusive, i.e., more than one condition can be present in a single case. The frequency distribution looks as follows:

The most frequently recorded weather conditions are "cloudy" (WEATHER1) and "rain" (WEATHER4).

INJSEVA indicates the most serious sustained injury in the accident. For instance, if one person was not injured, and another person suffered a non-incapacitating injury, injury class 2 was assigned to the case.

Information on injury severity has been taken from police accident reports, which are not available in the data. Unfortunately, this information does not necessarily align with the case description: There are many cases for which the case description indicates the presence of an injury, but INJSEVA does not, and vice versa.

For this reason, we created manually an additional column INJSEVB based on the case description, to indicate the presence of a (possible) bodily injury. The table below shows the distribution of number of cases by the two variables.

INJSEVA meaning count INJSEVB=0 INJSEVB=1
0 O - No injury 1'458 96 1'554
1 C - Possible injury 1'112 1'298 2'410
2 B - Non-incapacitating injury 729 945 1'674
3 A - Incapacitating injury 304 373 677
4 K - Killed 5 114 119
5 U - Injury, severity unknown 44 122 166
6 Died prior to crash 0 0 0
9 Unknown if injured 51 16 67
10 No person in crash 1 0 1
11 No PAR (police accident report) obtained 231 50 281
Total 3'935 3'014 6'949

Now we turn to the verbal accident descriptions. First, we examine the length of the English texts, SUMMARY_EN. To this end, we split the texts into words, with blank spaces as separator, and show a box plot of the text length by number of vehicles involved in the accident:

Not surprisingly, the length of the descriptions correlates with the number of vehicles involved.

The average length is above 400 words. As we will see later in this notebook, this poses some challenges with the NLP models that we are using in this notebook, because these are limited to text up to a length of 512 so-called "tokens" (vocabulary items). Since a single word may be tokenized into more than one token, some accident descriptions will be truncated.

Let's examine one of the English texts and its German translation:

To get an impression of the most frequent words, we generate a simple word cloud form all English case descriptions. By default, the word cloud excludes so-called stop words (such as articles, prepositions, pronouns, conjunctions, etc.), which are the most common words and do not add much information to the text.

2. A Brief Introduction to the HuggingFace Ecosystem

This tutorial uses NLP models provided by HuggingFace.

HuggingFace is a community that builds, trains and deploys state-of-the-art models for natural language processing, audio, computer vision etc. HuggingFace's model hub provides thousands of pre-trained models for these applications. The Transformers library offers functionality to quickly download and use those pre-trained models on a given input, fine-tune them on the own datasets and then share them with the community. The library is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow.

In this notebook, the following elements of the HuggingFace ecosystem will be used:

In the next sections we will briefly explore the first three components in turn. The trainer functionality will be used in Section 4 of this notebook.

2.1. Loading the Data into a Dataset

Datasets is a library for easily accessing and sharing datasets, and evaluation of metrics for NLP, computer vision, and audio tasks.

A dataset can be loaded in a single line of code, in our case directly from the pandas DataFrame. At the same time, we split the dataset into a training (80%) and a test dataset (20%). We fix the random seed for the sake of reproducibility.

Since the texts are relatively long, some parts of this notebook require computing resources. Uncomment the following line to reduce the size of the dataset.

The resulting DatasetDict behaves like a Python dictionary. Therefore, you can access the Dataset corresponding to each split by

The Dataset object behaves like a normal Python container. You can query its length, get rows or columns, etc. For instance, its length is:

To query a single row, you can use its index, like in a list: ds_train[0]. This returns a dictionary representing the row. Its elements can be accessed by the column names as keys, e.g. ds_train[0]["SCASEID"]. Multiple rows can be accessed by index slices, e.g. dataset["train"][:2], or by a list of indices, e.g. dataset["train"][0, 2].

You can list the column names and get their detailed types (called features):

Later in this tutorial we will get to know methods to process datasets, such as filtering the rows based on conditions, and processing the data in each row.

2.2 Tokenization: Split Raw Text into Vocabulary Items

Next, we convert the summary texts into tokens, i.e., the text strings are split into elements of the vocabulary of the NLP model.

As such, the tokenizer and the NLP model need to be aligned. Changing the tokenizer after training the model would produce unpredictable results.

Let's start with the model distilbert-base-multilingual-cased. As the name implies, this model is cased: it does make a difference between "english" and "English".

The model is trained on the concatenation of Wikipedia in 104 different languages listed here. The model has 6 layers, 768 dimensions and 12 heads, totalizing 134 million parameters. This model is a distilled version of the BERT base multilingual model which has 177 million parameters. On average, the distilled model is twice as fast as the original model.

If you want to use another model throughout this notebook, please feel free to simply change the following line!

As we can see, the tokenizer has a vocabulary of size 119'547. The maximum sequence length of the model is 512 tokens.

To see the tokenizer in action, we tokenize the first sentence of an accident description:

Calling the tokenizer returns a BatchEncoding object, which behaves just like a standard Python dictionary that holds input items used by the NP model. input_ids is the list of token IDs for each token. attention_mask is a list containing 1 for all elements that corresponds to tokens of the input text, and 0 for padding tokens that are appended to attain a specified sequence length.

To illustrate the meaning of the input IDs, we convert them back to token strings:

We observe that words like "V1", "Pontiac", "minivan", "driveway" etc. are split into multiple tokens each. This is typical for WordPiece tokenization adopted by BERT, an approach designed to reduce vocabulary size. This tokenizer marks sub-words by the prefix ##.

It is interesting to note that 2000 is a separate element of the vocabulary.

The first and last tokens of the tokenized sequence are CLS and SEP, respectively.

Here is a list of other special tokens used by the BERT tokenizer:

It is instructive to look at the tokenization of the German translation of the same text:

Tokenizers of multi-lingual models use the same vocabulary for all languages. Obviously, the tokenizer simply splits the input string into pieces and does not perform any translation: the English pronoun "a" (169) is a different token than the equivalent German "ein" (10290).

We observe that the tokenizer is case-sensitive: It differentiates between the tokens mini (25103) and Mini (32930).

So far, we have tokenized single sentences only. Next, we want to tokenize the entire dataset. This is easily achieved by applying the map function to the dataset.

All we need to provide to the map function is a function that takes a record or a batch of records from the dataset, applies an operation to it, and returns a DataSet or a dict which defines the columns to be added or updated.

In our case, we supply a function that calls the tokenizer as shown before. As we have seen, calling the tokenizer returns a dict with the keys input_ids and attention_mask. Therefore, the map function will add columns with these names to the original dataset.

Since we plan to feed the tokenized sequences into a transformer model, we need to truncate their length to the maximum length accepted by the transformer. Moreover, the shorter sequences need to be padded at the end, so that all tokenized sequences have the same length.

Overall, only a few lines of code are required to complete the tokenization:

The additional argument column is passed to tokenize via the the dictionary fn_kwargs. As we can see from the progress bars, the map function gets called twice - once for each split. As expected, new columns input_ids and attention_mask have been added to the dataset.

We repeat the same procedure for the German texts.

Later on, we will also use a dataset which has 80% English texts and 20% German texts:

Now we have created three datasets - with the tokenized English, German and mixed language texts, respectively.

We could have stored the results in a single dataset (with different column names), but keeping languages separately will make it easier to convince ourselves in the following examples that the languages have not been mixed up!

2.3. Transformer model

After completing the tokenization of the raw texts, we are ready to apply the transformer model, in our case the multilingual DistilBERT model.

First, we load the model. To speed up the following calculations, we opt for GPU support if available.

The warning message can be ignored for our application.

Let's examine the model structure:

As we can see, the first block of the model deals with embeddings, with the word embedding as the first layer. This is followed by the transformer which consists of 6 transformer blocks.

Let's first explore the word embedding.

The goal of the word embedding layer is to assign each element of the vocabulary a vector of length $E$.

The multilingual DistilBERT model has a vocabulary of size $V=119'547$ and a word embedding size of $E=768$. We can confirm this by looking at the dimension of the word embedding weight tensor:

To see the outputs of the transformer encoder, let's apply the transformer to the first record of the dataset, more precisely to its columns input_ids and attention_mask, the outputs of the tokenizer:

This produces a BaseModelOutput object which has a named property last_hidden_state, a tensor that represents the hidden state of the final transformer block, i.e. the encoded text sequence!

The dimension of the last hidden state is:

i.e., [number of samples (1), sequence length $T$ (maximum 512 tokens), embedding size $E$ (768)].

In what follows, we will use the information contained in this tensor to make predictions.

3. Using Transformers to Extract Features for Classification or Regression Tasks

In this section you will learn how transformers can be used to extract features from text data for a classification or regression problem.

The idea is simple: The tokenized raw text data is encoded by the transformer model, and the features are extracted from the last hidden state.

3.1. Extracting the Encoded Text

Before we have seen that the DistilBERT model encodes each token of each input sample into a tensor of length $E=768$. As such, the output of the transformer model depends on the length of the input sequences. To make predictions, we would prefer having a single vector per input sample, independent of the sequence length.

Different approaches are available to achieve this goal:

We will implement both techniques and compare results.

In the following cell we display a short function which applies the NLP model to a batch of encoded input samples, extracts the last hidden state, and returns two tensors of length 768 for each input sample, corresponding to the two methods explained before.

The cell is not executable, because the function is already defined in the module tutorial_utils we imported initially.

``` def extract_sequence_encoding(batch, model): input_ids = torch.tensor(batch["input_ids"]).to(model.device) attention_mask = torch.tensor(batch["attention_mask"]).to(model.device) with torch.no_grad(): # apply transformer model and extract last hidden state model_output = model(input_ids, attention_mask) last_hidden_state = model_output.last_hidden_state # extract the tensor corresponding to the CLS token, i.e. the first element in the encoded sequence batch["cls_hidden_state"] = last_hidden_state[:,0,:].cpu().numpy() # mean pooling: take average over input sequence, but mask sequence elements corresponding to the PAD token last_hidden_state = last_hidden_state.cpu().numpy() lhs_shape = last_hidden_state.shape boolean_mask = ~np.array(batch["attention_mask"]).astype(bool) boolean_mask = np.repeat(boolean_mask, lhs_shape[-1], axis=-1) boolean_mask = boolean_mask.reshape(lhs_shape) masked_mean = np.ma.array(last_hidden_state, mask=boolean_mask).mean(axis=1) batch["mean_hidden_state"] = masked_mean.data return batch ```

Let's apply this function to the first sample of the training data:

As desired, two additional columns cls_hidden_state and mean_hidden_state were appended.

Therefore, the function can be supplied to the familiar map function to add corresponding columns to the original dataset. The following lines do this for the full datasets.

On an AWS EC2 p2.xlarge instance, the run time is amore than 10 minutes. We save the resulting datasets to disk.

3.2. ... and Using It in a Classification Model

We will now use the encoded texts as features to predict labels taken from certain tabular information available in the dataset.

To this end, we use the following convenience functions implemented in tutorial_utils.py:

Now the toolbox is ready!

Next, we apply it to a simple classification task.

3.3. Case Study: Use Accident Descriptions to Predict the Number of Vehicles Involved

In this case study, we will predict the number of vehicles involved in an accident from the verbal accident description.

Since the data set contains the column NUMTOTV, we can adopt a supervised learning approach.

We might consider framing the problem as a regression task, e.g. using Poisson regression. However, looking at the frequenca distribution of NUMTOTV, it apears unlikely that the Poisson distribution is a good reflection of reality. First, there are no accidents with zero vehicles involved - it takes at least one. So we might consider using a zero-truncated Poisson model. However, the empirical frequency distribution has low mass at high vehicle counts, so that this would not be a plausible model either.

Therefore, we frame the prediction task as multinomial classification. Given that only a small fraction of cases involves four or more vehicles, and to avoid a heavily imbalanced classification problem, we map these cases to an aggregated class "3+".

To achieve this, we map the column NUMTOTV to a new column labels, with levels 0 (1 vehicle), 1 (2 vehicles) and 2 (3 or more vehicles). We choose the column name labels because this is expected by the sequence classification model which we fit in Section 4.2.

As explained in Section 3.1, we will explore two different ways to use encoded texts:

  1. Use the hidden state corresponding to the CLS token, which is the first token of the input sequence in BERT models.
  2. Mean pooling: Taking the average of the tensors over all elements of the sequence.

Let's start with the first approach by using the feature cls_hidden_state produced in Section 3.1.

Using the toolbox developed before we fit a dummy classifier and a logistic regression classifier to the features and labels of the English dataset.

We obtain an accuracy score of 91%, compared to 57% with the dummy classifier. This is already a very good result!

Remember, we have just used the DistilBERT transformer off the shelf, with no tuning whatsoever, to extract a vector of length 768 representing the information contained in the accident descriptions. During this entire text encoding, the transformer model was unaware that its output was going to be used to predict the number of vehicles.

How about the second approach, which uses the feature mean_hidden_state that was extracted by mean pooling over the entire encoded sequence?

Let's see:

Again, we have used DistilBERT without any fine-tuning.

For the present task, by any of the considered scores, mean pooling performs much better than using the encoding of the CLS token. For this reason, we use mean pooling in what follows.

What would you guess - will the classifier model exhibit a similar performance when trained on the encoded German dataset?

Let's check:

Yes indeed, the performance on the English and German datasets are comparable. This is what we would have expected - after all we are using a multilingual transformer model.

3.4. Cross-Lingual Transfer

In practice, it might happen that training data is available (predominantly) in one language, but we would like to apply the model to test data in another language. Translating the test data to the language of the training data would be an option, but let's see how the multilingual transformer model performs.

In our small experiment, we simply switch the languages of the test sets. This might be hard for the models, since in the entire training process each model has seen only encoded input from text samples in one language!

First, use the German test set for the model trained on English input:

From these rather poor results, we conclude that this approach to cross-language transferability does not work.

Vice versa, use the English test set for the model based on German input:

Again, performance is unsatisfactory.

To improve results, we need to change the approach.

3.5. Multi-Lingual Training

In a multilingual situation, a possible approach is to train the classifier with a training set consisting of encoded samples from both languages. This can always be achieved by translating a fraction of the text data and then use it to train the model.

This is exactly what we are going to do next. In order to simulate a situation where one language is underrepresented, we create a mixed-language dataset with about 80% English and 20% German samples, our dataset dataset_mx produced in Section 2.2.

Since we are already using a multilingual transformer model, no further changes are required.

This is a very good outcome. The scores are close to those achieved in the situation with a single-language!

To conclude, a multi-lingual situation can be handled by a multi-lingual transformer model. For the best performance, the classifier should be trained on the encoded sequences from all languages.

4. Fine-Tuning – Improving the Model

In the previous case study, we have used the DistilBERT model without any adaptation to the text data at hand, simply by using the sequence encoding produced by the model. As such, the language representation, which the model has learned from a large corpus of multilingual data, is transferred to the text data at hand. This approach is called transfer learning. The advantage of transfer learning is that a powerful (but relatively complex) model can be trained on a large corpus of data, using large-scale computing power, and then be applied to situations where availability of data or computing power would not allow for such complex models.

For the task at hand, the results are already very good. However, in certain situations it might be required to further improve model performance.

In the following sections you will learn how to fine-tune a transformer model. We will explore two approaches to fine-tuning:

The advantage of the first approach is that it can be performed in an unsupervised fashion, i.e., it does not require labeled data.

On the other hand, task-specific fine-tuning is expected to produce better performance on the particular task which the model was tuned for, so it might be the method of choice if there is a single down-stream task and sufficient labeled data.

Let's explore these two fine-tuning approaches in turn.

4.1. Domain-specific fine-tuning

Domain-specific fine-tuning can be achieved by applying the model to a "masked language modeling" task. This involves taking a sentence, randomly masking a certain percentage of the words in the input, and then running the entire masked sentence through the model which has to predict the masked words. This self-supervised approach is an automatic process to generate inputs and labels from the texts and does not require any humans labelling in any way.

This is very easy to implement using the Transformers library. You will see three new elements of the Transformer library in action:

Depending on the hardware available, training might take a rather long time. Therefore, if available, we use GPU support. On an AWS EC2 p2.xlarge instance, the run time is about 55 minutes. We store the trained model for later use.

If you do not have enough time to perform this step right now, you can skip this section and return later. The remainder of this notebook does not depend on it.

Now, model_mlm holds the DistilBERT model, fine-tuned to the mixed-language accident descriptions using masked-language-modeling.

Next, we apply this model to all input sequences and extract the last hidden state. The procedure is the same as in section 3.1. To avoid confusion, we create new datasets, and store them on disk for later use, so that this step does not need to be repeated all over when this notebook is re-run.

Now let's see to what extent domain-specific fine-tuning is able to improve the performance of the classification model.

To this end, we perform the same steps as in Sections 3.3-3.5:

By comparing to the above results, we observe that the domain-specific fine-tuning on the English training set has improved the scores, but not to a satisfactory level for the cross-language transfer cases.

4.2. Task-specific fine-tuning

An alternative to domain-specific fine-tuning is task-specific fine-tuning.

The idea is to train a transformer model directly on the task at hand, in our case a sequence classification task. The process is very similar to the masked language modeling used for domain-specific pre-training, except that we load a sequence classification model using the class AutoModelForSequenceClassification.

The following code tunes a sequence classification model that uses the English accident descriptions to predict the number of vehicles involved. On an AWS EC2 p2.xlarge instance, the run time is about 20 minutes.

The scores on the English test set have improved to fantastic levels.

What is even more impressive is the performance on cross-lingual transfer: Despite the fact that the model has been trained on English texts only, its performance scores on the German test set are very good.

This is an excellent result!

5. Understand Predictions Errors and Interpret Predictions

In this section you will learn how to analyze prediction errors and how to interpret predictions.

We will study a more challenging example.

5.1 Case Study: Use Accident Descriptions to Identify Bodily Injury

As seen in the previous section, predicting the number of vehicles from the available accident descriptions is a relatively easy task for the transformer model, even in a multi-lingual situation.

Therefore, we will turn to a somewhat more difficult task: identifying cases which lead to bodily injuries. We cuse the column INJSEVB as label.

The process is identical to the previous case study:

In case you have skipped Section 4.1 Domain-specific finetuning, the dataset ../datasets/dataset_en_pretrained will not be available. In this case simply comment out the last lines of each block below.

In case you have skipped Section 4.1 Domain-specific finetuning, please also skip the following cell.

We observe the following:

Next, we perform task-specific fine-tuning. On an AWS EC2 p2.xlarge instance, the run time is about 20 minutes.

We observe the following:

5.2. Investigate False Positives and False Negatives

To investigate the prediction errors, we export the predictions into an Excel file with the following columns:

column meaning
SCASEID unique identification number of the case
SUMMARY_EN description of the accident, in English
SUMMARY_TRUNCATED description of the accident, in English, truncated to a length of 512 tokens
INJSEVA most serious injury sustained in the case, as per Police Accident Report
labels indicator of odily injury INJSEVB (true label)
pred predicted label
0 probability of negative label
1 probability of positive label

The first step of the error analysis is to inspect the samples producing false negative and false positive predictions. Reading every single text would be very tedious, therefore it is worthwhile focusing on those examples where the probability assigned to the false prediction was high, i.e., cases where the model was confident but wrong.

Looking at the false negatives, we observe that there are many cases where the model assigns a high probability to negative. We suspect that truncation is responsible for many of the false negatives – the relevant part of the text was discarded.

To address this issue, we split the text into slightly overlapping chunks, run the prediction on each chunk and apply the logical OR-function to the results. We implement this functionality in a simple function that returns an additional column pred, containing a list of predicted labels, with one element for each chunk.

The number of false negatives has reduced significantly, as expected, and the accuracy score has improved. Since we have not implemented a logic to combine the predicted probabilities of the different chunks, the log loss and Brier loss cannot be evaluated in this case.

5.3. Use Captum and transformers-interpret to Interpret Predictions

Transformer models are quite complex, and therefore, interpreting model output can be difficult.

Our main interest is in knowing which parts of the input text cause the classifier to arrive at a particular prediction. One way to answer this question is the so-called integrated gradients method. It is provided conveniently by the library transformers_interpret which provides a convenient interface to the library Captum, an open source, extensible library for model interpretability built on PyTorch.

With just a few lines of code, we can run this on individual examples, and receive a graphical output as shown below. Of course, the output is also available in numerical form. We run this on CPU because on the AWS p2.xlarge instance, the GPU ran out of memory.

6. Using Extractive Question Answering to Process Longer Texts

In this section we use extractive question answering to extract parts of the accident description which indicate the presence of bodily injury. The aim is to reduce the length of the input texts by extracting only the relevant parts.

The easiest implementation of extractive question answering is provided by the pipeline abstraction.

We use deutsche-telekom/bert-multi-english-german-squad2, a multilingual English German question answering model built on bert-base-multilingual-cased. By specifying device=0 we use GPU support.

We visit each accident report in turn (the context), and ask the model the two questions “Was someone injured?” and “Was someone transported?”. Since the accident reports might provide information on multiple persons, we allow a maximum of four candidate answers for each of the questions, which we concatenate into a single (much shorter) new text.

To achieve this, we write a short function which applies a question answering pipeline to an input text x. The argument questions is a list of questions.

We apply the question answering function to the entire test set.

On an AWS EC2 p2.xlarge instance, the run time is about 6 minutes. If you want to try the concept on only the first 250 samples, you can use ds_test = dataset["test"].select(range(250).map(...

Next, we tokenize the extracted texts and define the labels, and store the dataset for later use:

We load the transformer model that was trained on the classification task...

...apply it to the tokenized text extracts and evaluate the predictions.

The performance is comparable with the logistic regression classifier on mean-pooled encodings of the original texts. On the other hand, from there is a larger number of false negatives than obtained by task-specific training and evaluation on the full-length sequence. This indicates that in some cases the extractive question answering has missed out or suppressed certain relevant parts. For instance, if the original text reads “The driver was injured.”, the extract “The driver” is a correct answer to the question “Was someone injured?”; however, it is too short to detect the presence of an injury from the extract.

7. Conclusion

Congratulations!

In this notebook, you have learned how to apply transformer-based models to classification tasks that often arise in actuarial applications.

You have seen how to address challenges that often arise in practical applications:

a. The text corpus may be highly domain-specific, i.e., it may use specialized terminology. – In Section 4.1 we have applied domain-specific fine-tuning to improve model performance in a specific domain.

b. Multiple languages might be present in parallel. – In Section 3.5 we have used a multi-lingual transformer model to encode multi-lingual texts and to use this output for a classification task. Performance was good even when one language is underrepresented.

c. Text sequences might be short and ambiguous. Or they might be so long that it is hard to identify the parts relevant to the task. – In this tutorial we have demonstrated two approaches to deal with long texts:

d. The amount of training data may be relatively small. In particular, gathering large amounts of labelled data (i.e., text sequences augmented with a target label) might be expensive. – Throughout this workbook, we have used transformer models which have been trained on a large corpus of text data. We have applied these models to the specific task with no or little specific training, thus transferring the language understanding skills to the task at hand.

e. It is important to understand why a model arrives at a particular prediction. – In Section 5.3 we have shown how to visualize which parts of the input text cause the classifier to arrive at a particular prediction.

The notebook Part II deals with another dataset that has only short text descriptions. It demonstrates possible approaches in case no or few labels are available.